Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: Pod Scheduling Readiness #3522

Merged
merged 9 commits into from
Oct 6, 2022

Conversation

Huang-Wei
Copy link
Member

  • One-line PR description: Initial KEP writeup for Pod Schedule Readiness
  • Other comments:

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 16, 2022
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 16, 2022
@Huang-Wei Huang-Wei changed the title First version of KEP: Pod Schedule Readiness KEP: Pod Schedule Readiness Sep 16, 2022
Copy link
Member

@ddebroy ddebroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really useful! A few comments/clarifications.

keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved
keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved
keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved
keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved
keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved
keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved
keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved
keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved
keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved
keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved
keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved
keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved
keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved
Comment on lines 265 to 266
As an advanced scheduler developer, I want to compose a series of scheduler PreEnqueue plugins
to guard the schedulability of my Pods.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we list other real cases which also benefits from this extension point.

keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved
keps/sig-scheduling/3521-pod-schedule-readiness/README.md Outdated Show resolved Hide resolved
nitty-gritty.
-->

We propose a new field `.spec.schedulePaused` to the Pod API. The field is defaulted to `false`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking (and I suggested it to @ahg-g offline just now) that it might be worth making this a list of strings, in a similar fashion to finalizers.

Once the list is empty, the pod is allowed to be scheduled.

This allows multiple controllers/webhooks to decide whether a pod can be scheduled based on different criteria.

It might be useful to have an endpoint just for this field (and disallow edition via general patch/update) so that there are less chances that a controller accidentally removes all the strings. Or maybe we can just have a simple update validation: only one string can be removed at a time.

Although, for finalizers we don't do that. The only related thing is an RBAC rule that allows controllers to just modify finalizers but nothing else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good idea, and I am wondering if we have any "regrets" related to the finalizers pattern that we need to consider when evaluating this API format, @liggitt @smarterclayton any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pwschuurman this should help simplify #3336

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alculquicondor for this idea. This looks more extensible, and we may name it
ScheduleReadinessGates []string.

I'm interested to learn from API experts if there are any traps in adopting this finalizers-like API pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great, then each controller can have its own enqueue plugin if it needs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also have creation default the pod Scheduled condition to false and reason WaitingForGate.

That's fair. Just to clarify: the condition "PodScheduled=WaitingForGate" is applied when it goes through the lightweight scheduling process, instead of defaulted by pkg/apis/core/<something>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we can default it at pod creation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we don't have precedence of defaulting a condition? (I checked pkg/registry/core/pod/strategy.go)

But, if API folks are fine with the pattern that at least save the scheduler's efforts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're adding a Kube feature, defaulting can do many things, while it's novel, it's certainly worth trying. If we come up with a reason to remove it before beta, that's low cost, but we do have defaulting of status (namespace status phase defaulting). I think it's a nice usability win, and we can record "will look for implications of novel condition defaulting for GA". We may only be able to do the defaulting during creation (can't do it during update), so that's a possible wrinkle. Could simply list it as a run-at for mitigation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a nice usability win

+1. Already added it in the monitoring section.

keps/prod-readiness/sig-scheduling/3521.yaml Outdated Show resolved Hide resolved
nitty-gritty.
-->

We propose a new field `.spec.schedulePaused` to the Pod API. The field is defaulted to `false`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good idea, and I am wondering if we have any "regrets" related to the finalizers pattern that we need to consider when evaluating this API format, @liggitt @smarterclayton any thoughts?

@pwschuurman
Copy link
Contributor

I have a use case as described in #3335, that could be alternatively achieved through this KEP.

A user wants to migrate a StatefulSet from across a Kubernetes cluster (or namespace). Since a StatefulSet object is a namespaced resource, in order to maintain at-most-one semantics during migration, if a pod is migrated to the new namespace, it shouldn't be able to start in the old namespace. In #3335, the approach is to modify the StatefulSet API to achieve this. An orchestrator that was aware of resources cross-cluster or cross-namespace would be responsible for manipulating which pods should be available to each StatefulSet.

However in this KEP, this behavior could be achieved by having the orchestrator set the schedulePaused boolean on a pod as needed (set by a mutating webhook that the orchestrator installs), which would prevent a pod from being scheduled in the previous cluster/namespace. By controlling which pods get scheduled in this manner and when, pods could be migrated in an ordered fashion.

@Huang-Wei
Copy link
Member Author

@pwschuurman thanks for sharing the usecase. In your migration flow, do you need to manipulate the old scheduled statefulset to update its scheduledPaused to true? Or, you just create new replicaset with scheduledPaused=true, and migrate one by one?

@pwschuurman
Copy link
Contributor

@pwschuurman thanks for sharing the usecase. In your migration flow, do you need to manipulate the old scheduled statefulset to update its scheduledPaused to true? Or, you just create new replicaset with scheduledPaused=true, and migrate one by one?

For the migration use case, I think the new StatefulSet can be created with scheduledPaused=true. Only the new pods go through true->false transition for scheduledPaused. For the old scheduled StatefulSet, the .spec.replicas can be scaled down by one (highest ordinal first), and .spec.scheduledPaused=true can be set the corresponding pod in the new StatefulSet (highest ordinal first).

To support rollback (reverse migration) though, pods will need to go through a false->true transition. As discussed in the KEP, there isn't a strong use case to support direct modification from false->true on an existing pod. To support this scenario for our use case though, the pod can be deleted (and .spec.scheduledPaused=true set through a webhook when the pod gets re-created).

@Huang-Wei
Copy link
Member Author

To support rollback (reverse migration) though, pods will need to go through a false->true transition.

Good to know this scenario.

there isn't a strong use case to support direct modification from false->true on an existing pod
the pod can be deleted

SG, this fits the current consensus #3522 (comment) to prevent false->true transition for updating an existing Pod (you have to recreate the pod).

@Huang-Wei Huang-Wei force-pushed the kep/ready-to-schedule branch from 949c54e to 27398cd Compare September 20, 2022 23:36
@Huang-Wei Huang-Wei force-pushed the kep/ready-to-schedule branch 2 times, most recently from 77019cf to 4f01ee3 Compare September 23, 2022 01:12
@Huang-Wei
Copy link
Member Author

@ahg-g @smarterclayton Updated the doc based on the comments:

  • reword schedulePaused to schedulingPaused
  • support set spec.nodeName without clearing schedulingPaused
  • add schedulingGates option into a UNRESOLVED section
  • add a story to support the needs of []string instead of a single boolean
  • update the version skew section
  • change the status from provisional to implementable

// This option will need to be accompanied with a yet-to-be-implemented fined-grained
// permission (https://docs.google.com/document/d/11g9nnoRFcOoeNJDUGAWjlKthowEVM3YGrJA3gLzhpf4)
// and consistent with how finalizes/liens work today.
SchedulingGates []string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton it seems that now you are leaning towards this option vs a flag, correct? If so, we also need to discuss the format, we now have two:

  1. similar to finalizers: actors add to the list and later remove from the same list
  2. similar to readinessGates: actors add to the list in the spec, and later add to a parallel list in status

Regarding option 2, do we have concerns that it will require controllers to have update permissions on pod status? I guess not if we allow updating the list only through a dedicated endpoint since we want to enable fine grained permissions.

I like option 2 because it preserves the history of who decided that a pod is ready, which is useful for debugging and tracking feature usage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'm in favor of readinessGates as (a) we have pod readiness gates as precedence, and (b) each controller don't need to worry about updating the same API field, and thus keep some factors like ordering away. However, two aspects are still unresolved (#3522 (comment))

  • Permission. It's still unclear to me. The pod readiness gates KEP didn't mention about permission. So I really hope @smarterclayton and @liggitt can shed some light - as if we think permission is mandatory here, we have to make it right since day 1.
  • Restrictive condition transition. I suppose it's doable to enforce the condition that is transited from NotReady to Ready only. But reposted here for confirmation.

Copy link
Contributor

@smarterclayton smarterclayton Sep 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to read the arguments around pod readiness gates and the whole status thing. Scheduling happens before ownership is transitioned to a node, I would probably argue that like finalizers, all mutations happen within spec (i view metadata as generic spec in this scenario).

Anyone who can set spec.nodeName is already as powerful as someone who can set the proposed scheduling gates. OpenShift and some others had admission checks that limited who could mutate that field to effectively cluster admins (roughly the mutation of that field required an RBAC check you have to have the binding subresource permission of a scheduler).

Subresource permission is supported today, so adding a subresource is the easiest way, but we'd probably have to have special cases for what characteristic would be permission (i.e. a check on readiness/<name>), but it's not terribly difficult today to make the sub resource pods/readiness/<name>. The more we start bulk updating gates (for instance if the scheduler implements a set of gates via plugins and we want to bulk submit, vs letting individual actors coordinate) the more we get closer to just performing field level checks in an admission handler.

How would we model multiple gates coordinating in a single scheduler? I would very much dislike having multiple individual writes to the object, but anything that happens in a single process and is fairly quick should probably just be a single gate (even if it's implemented as a composite of steps inside the plugin).

Ultimately subresources work well when the use case is narrow and you don't need to subdivide clients excessively - this is starting to cross that line for binding if we expect lots of gates from a single client being set in a row.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more we start bulk updating gates (for instance if the scheduler implements a set of gates via plugins and we want to bulk submit, vs letting individual actors coordinate) the more we get closer to just performing field level checks in an admission handler.

I don't think the scheduler itself will have plugins that update gates, the scheduler will only respect those gates (i.e., if they are present, it will simply not attempt to schedule the pod).

The gates will be updated by external controllers, like a queueing controller or a statefulset migration controller, those by design will likely be controllers that don't share memory and run as separate processes. I think the proper design pattern should be that a controller only handles/updates a single gate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No matter if the field is a single boolean or a list of strings, a Pod's creation is the only chance to set the complete gating condition(s). For a single boolean, true represents the condition; while for a list of strings, it's highly customized list. After a Pod's creation, only "semantical subtraction" can be allowed: for boolean, it's true -> false; for a list of strings, it's "delete a particular element from the list" (it just behaves differently for finalizer-like list vs. readiness gates-like list)

I would very much dislike having multiple individual writes to the object

For the design of a list of strings, both options (finalizer-like and readiness gate-like) will inevictably yield N individual writes to a Pod - each controller needs to spawn an API call to relax the readiness gate anyway (either by deleting one entry from the finalizer-like lists, or updating their particular .status.conditions[x] to ready/true).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm. In essence, we have a spectrum:

  1. Initializers / finalizers - []string, removal only, additional data stored on object or at worst in meta, extremely unlikely we add supporting data
  2. schedulingGates - removal only, additional data stored on pod spec, potentially other data in the future but unlkely, users unlikely to need the status of each gate, integrators likely able to edit spec
  3. readinessGates - []struct + conditions, readiness is inherently reactionary and two-way, gates say they are one way but are not (and probably can't safely change conditions), it think it is possible to lose all node state in such a way that gates have to be re-executed (but we don't clearly define that anywhere), integrators SHOULD NOT be able to edit spec, users want to see the status of each gate

I think the argument I'm thinking about is that nothing requires conditions, therefore why are they a part of the flow. Consistency is one argument, but I think the use case is closer to initializers / finalizers, and since we don't have a need, I'd argue YAGNI. If we do get feedback that users want a lot of condition state from gates, we could reassess in beta (the structure wouldn't change for our spec).

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with option 2, and I think even now integrators are free to add conditions to the pod if they wish to offer more status information, don't they? So there is no need to tie that to our feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, good argument.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option 2 should work for most cases. If there is some case like re-scheduling a pod without deleting it, it's also extensible / backward-compatible to add fields.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@Huang-Wei Huang-Wei force-pushed the kep/ready-to-schedule branch from c82765e to e3b7d3e Compare October 3, 2022 18:08
@ahg-g
Copy link
Member

ahg-g commented Oct 4, 2022

@smarterclayton @Huang-Wei I guess all unresolved issues are now addressed, I think we settled on the API and the overall semantics. The deadline is in a couple of days :)

@Huang-Wei
Copy link
Member Author

@ahg-g yup, I think all comments have been resolved. Once @smarterclayton and @wojtek-t give a green light, I will squash the commits.

@smarterclayton
Copy link
Contributor

Doing a final pass now.

or not. This enables scheduler plugin developers to tryout the feature even in Alpha, by crafting
different enqueue plugins and wire with custom fields or conditions.

- **New column in kubectl:** To provide better UX, we're going to add a column "scheduling paused"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a lot of columns, I just expected this to show up in the phase column (where we summarize the state of the pod). I don't want to specifically add another column unless we have a great reason, and this column will be empty for 99% of pods for 99.999% of their lifecycle. Can we update the text?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fair. Updated.


type PodSchedulingGate struct {
// Name of the scheduling gate.
Name string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify here that name must be unique among gates? You say map earlier, but we don't clearly describe here that this must be unique among other gates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@wojtek-t
Copy link
Member

wojtek-t commented Oct 5, 2022

@Huang-Wei - the PRR will require more for Beta, but it's good enough for Alpha.

/approve PRR

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 5, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 5, 2022
@wojtek-t
Copy link
Member

wojtek-t commented Oct 5, 2022

@ahg-g @smarterclayton - please LGTM it once you're ready

[fine-grained permissions](https://docs.google.com/document/d/11g9nnoRFcOoeNJDUGAWjlKthowEVM3YGrJA3gLzhpf4))
- Gather feedback from developers and out-of-tree plugins.
- Benchmark tests passed, and there is no performance degradation.
- Update documents to reflect the changes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Identify whether gates can be added post-creation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

- Feature disabled by default.
- Unit and integration tests completed and passed.
- API strategy test (`pkg/registry/*`) to verify disabled fields when the feature gate is on/off.
- Additional tests are in Testgrid and linked in KEP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Determine whether any additional state is required per gate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@ahg-g
Copy link
Member

ahg-g commented Oct 5, 2022

/lgtm
/hold

In case there are any last comments

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Oct 5, 2022
@smarterclayton
Copy link
Contributor

lgtm - excellent minimal proposal which can unlock extensibility and composition for schedulers and workloads.

@wojtek-t
Copy link
Member

wojtek-t commented Oct 6, 2022

lgtm - excellent minimal proposal which can unlock extensibility and composition for schedulers and workloads.

+1

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 6, 2022
@k8s-ci-robot k8s-ci-robot merged commit fac005a into kubernetes:master Oct 6, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Oct 6, 2022
@Huang-Wei Huang-Wei deleted the kep/ready-to-schedule branch October 6, 2022 16:08
@Huang-Wei
Copy link
Member Author

Thanks to all reviewers for your dedicated time and valuable comments! Without that, I cannot make this KEP go through.

I'd particularly appreciate primary reviewers @ahg-g and @smarterclayton, for your continuous insightful comments to help shape the final KEP.

@Huang-Wei Huang-Wei changed the title KEP: Pod Schedule Readiness KEP: Pod Scheduling Readiness Oct 6, 2022
ahmedtd pushed a commit to ahmedtd/enhancements that referenced this pull request Feb 2, 2023
* First version of KEP: Pod Schedule Readiness

* address comments including:

- reword `schedulePaused` to `schedulingPaused`
- support set spec.nodeName without clearing schedulingPaused
- add `schedulingGates` into UNRESOLVED
- add a story to support needs of array instead of a single boolean
- update the version skew section
- change the status from `provisional` to `implementable`

* fixup: rewordings

* fixup: major update

* minor: polish wordings

* address PRR comments

* address PRR comments (cont.)

* some rewordings

* address comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.